About the project

This project is for the course Introduction to Open Data Science in University of Helsinki, fall 2018. The learning objectives of this course are to understand possibilities of open science, open data, and open research tools, visualize and analyze data sets with some fairly advanced statistical methods, master a few state-of-the-art, freely available, open software tools, apply the principle of reproducible research and see its practical advantages, and learn more of all this in the near and distant future (“life-long learning”). My github repository can be found at: https://github.com/ellitaimela/IODS-project


RStudio Exercise 2: Regression Analysis on Students’ Learning Data

The data analysis exercise for the second course week consisted of analysing students’ learning data from the course Introduction to Social Statistics, collected during December 2014 and January 2015. The exercise consisted of reading and exploring the structure of a dataset provided, showing a graphical overview and summaries of the data, and finally creating and analysing a regression model from the data.

  1. Reading and exploring the data

I retrieved the data provided for this exercise from AWS. The data, formatted in a .txt file, contained data of students’ learning outcomes in the course Introduction to Social Statistics held in fall 2014. The data included 166 observations of 7 variables. The variables represented the students’ gender, age, and attitude towards the course, and the students’ success in exam/exercise questions related to deep learning, strategic learning, and surface learning, and the granted for the students.

# Reading the data
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", 
                    sep=",", header=TRUE)
# The data includes students' learning data from the course Introduction to Social Statistics, 
# collected during 3.12.2014 - 10.1.2015

# Exploring the structure and dimensions of the data; 166 observations, of 7 variables
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166   7
# Summaries of the variables
summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
  1. Creating a graphical overview of the data

I created a graphical overview of the distribution of the variables and their relationships. First, I started with boxplotting the whole dataset to see and compare the distributions of the variables. Second, I created a relationship plot matrix with ggpairs to see the overall picture of the relationships between variables. After this, I examined the distributions of variables in more detail. I showed the distribution of students by gender with a barplot, and for the rest of the variables, I examined the distributions with histograms. I also analyzed the relationships of the variables with qplots and plots.

library(ggplot2)
library(GGally)

# Plotting the distributions of the variables
boxplot(lrn14)

# Creating a plot matrix with ggpairs
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))

# 110 females, 56 males
barplot(table(lrn14$gender))

# Age distribution from 17 to 55, median 22, mean 25.51, standard deviation 7.766078
hist(lrn14$age)

# Attitude from 1 to 5, median 3.2, mean 3.14, standard deviation 0.5327656
hist(lrn14$attitude)

# Points of deep learning questions from 1 to 5, median 3.667, mean 3.680, sd 0.5541369
hist(lrn14$deep)

# Points of strategic learning questions from 1 to 5, median 3.188, mean 3.121, sd 0.7718318
hist(lrn14$stra)

# Points of surface learning questions from 1 to 5, median 2.833, mean 2.787, sd 0.5288405
hist(lrn14$surf)

# Points granted altogether from 7 to 33, median 23.00, mean 22.72, sd 5.894884
hist(lrn14$points)

# A positive colleration between attitude and points can be found
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")

qplot(age, points, data = lrn14) + geom_smooth(method = "lm")

qplot(surf, points, data = lrn14) + geom_smooth(method = "lm")

# There is no correlation between age and attitude
qplot(age, attitude, data = lrn14) + geom_smooth(method = "lm")

# The correlation between age to points achieved in strategic and deep questions looks minimal
qplot(age, stra, data = lrn14) + geom_smooth(method = "lm")

qplot(age, deep, data = lrn14) + geom_smooth(method = "lm")

# There can be seen some negative correlation between age and points achieved in surface questions - statistical significance not analyzed
qplot(age, surf, data = lrn14) + geom_smooth(method = "lm")

# On average, men achieved marginally higher points compared to women - statistical significance not analyzed
plot(lrn14$points ~ lrn14$gender)

# On average, women achieved higher points in strategic and surface learning questions - statistical significance not analyzed
plot(lrn14$stra ~ lrn14$gender)

plot(lrn14$deep ~ lrn14$gender)

plot(lrn14$surf ~ lrn14$gender)

  1. Creating a regression model

I started building the regression model by choosing three variables - attitude, stra, and surf - as explanatory variables and fitting a regression model where exam points was the dependent variable. I chose the variables because they had the hichest correlation with points, as could be seen in the ggpairs matrix.

# Fitting a linear model
model1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Of the explanatory variables, only attitude turned out statistically significant, its p-value being 1.93e-08. The p-values of stra and surf were not statistically significant, so I excluded them from the further analysis. I eventually tested combinations of all variables in the dataset, but I found no other statistically significant explanatory variable than attitude. Therefore, I created a new regression model with attitude as the only explanatory variable. Statistically significant (p<0.0001) estimates for both the intercept and attitude were found.

# Fitting a new model with attitude as the only explanatory variable
model2 <- lm(points ~ attitude, data = lrn14)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
  1. Relationship between variables and goodness of fit

As can be seen in the summary of the regression model above (model2), exam points are positively correlated with the one’s attitude. This is intuitional, because a high (positive) attitude towards a specific topic typically equals a higher interest and motivation to learn, which typically leads one to investing more time and effort to the topic. OF the summary, we can see that one increase in attitude (scale from 1 to 5) increases points by 3.5255 on average.

The residual sum of squares (multiple R-squred) represents how close the data is to the fitted regression line of the model. R-squared achieves values from 0 to 100 % (or 1), which describes the percentage of the target variable variation that can be explained by the model. Multiple R-squared of model2 is 0.1906, which means that a minor part of the data lies close to the regression line.

  1. Validity of assumptions of the model

One natural assumption of the linear regression model created above is linearity. Another assumption related to linear regression models is that errors are normally distributed, they are not correlated, and have a constant variance. To analyze whether these assumptions are actually valid, we can take a closer look at the residuals of the model.

The normality of the errors can be analyzed with the Q-Q plot. As can be seen, a clear majority of the residuals follow the line. Only the residuals at very low and very high values deviate from the line. Therefore we can assume the error terms mostly to be normally distributed.

By plotting the residuals towards the fitted values of the model, we can see if there is any pattern that tells how the residuals change and deduce whether the variance of the residuals is constant. As we look at the Residuals vs. Fitted plot below, there does not seem to be a clear pattern. Therefore we can consider the assumption of a constant variance somewhat valid.

By examining the Residuals vs. Leverage plot below, we can also analyze the impact of single observations on the model. As we can see, no single observation stands out clearly, and the value of the leverages of all observations are low. Therefore we can reason that the impact of single observations on the model is not too high.

par(mfrow = c(2,2))

# Visualizing Residuals vs. Fitted values
plot(model2, which = 1)

# Visualizing a normal QQ-plot
plot(model2, which = 2)

# Visualizing Residuals vs. Leverage
plot(model2, which = 5)


RStudio Exercise 3: Logistic Regression

The data analysis exercise for the third course week consisted of analysing the relationship between students’ alcohol consumption and learning outcomes.

  1. Creating the markdown file

A new markdown file ‘chapter3.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.

  1. Reading and exploring the data

I read the data formulated in the Data Wrangling part of this week’s exercise, and checked the structure, dimensions and summary of the dataset. The data describes student achievement in secondary education at two Portuguese schools. The data attributes included student grades, demographic, social and school related features, and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)

# Setting the working directory
setwd("/Users/ellitaimela/Documents/Github/IODS-project/Data")
# Reading the data
alc <- read.csv("alc.csv", sep=",")
# This data approach student achievement in secondary education of two Portuguese schools. 
# The data attributes include student grades, demographic, social and school related features) 
# and it was collected by using school reports and questionnaires. Two datasets are provided 
# regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

# Checking the data, and the structure, dimensions and summary of the dataset
# alc
str(alc)
## 'data.frame':    382 obs. of  36 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382  36
summary(alc)
##        X          school   sex          age        address famsize  
##  Min.   :  1.00   GP:342   F:198   Min.   :15.00   R: 81   GT3:278  
##  1st Qu.: 96.25   MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104  
##  Median :191.50                    Median :17.00                    
##  Mean   :191.50                    Mean   :16.59                    
##  3rd Qu.:286.75                    3rd Qu.:17.00                    
##  Max.   :382.00                    Max.   :22.00                    
##  Pstatus      Medu            Fedu             Mjob           Fjob    
##  A: 38   Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  T:344   1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##          Median :3.000   Median :3.000   other   :138   other   :211  
##          Mean   :2.806   Mean   :2.565   services: 96   services:107  
##          3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##          Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 
  1. Hypothesis of the relationships between high/low alcohol consumption and other variables

I chose 4 variables to study their relationships with the variables and high/low alcohol consumption. The variables I chose were famrel (quality of family relationships), goout (going out with friends), absences (number of school absences), and G3 (final grade). According to my initial hypothesis, students who have worse relationships with their family tend to feel worse and consume more alcohol. I also hypothesized that going out with friends and the number of school absences positively correlate with alcohol consumption, and the final grade negatively correlates with alcohol consumption.

  1. Variables’ relationships with alcohol consumption

I draw boxplots of the chosen variables distributions with division to high and low alcohol consumption. As can be seen, the results are in line with my initial hypotheses. The average grades are higher with students that consume less alcohol, and students who have better relationships with their family members consume less alcohol. Students who go out with friends more or have more absences also consume more alcohol. The numerical differences of the means of the variables can also be seen below. However, the statistical signifigance cannot be deduced from these results.

# initialize a plot of high_use and family relationships
g1 <- ggplot(alc, aes(x = high_use, y = famrel))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("sex")

# initialize a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("going out")

# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = absences))

# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("absences")

# initialize a plot of high_use and G3
g4 <- ggplot(alc, aes(x = high_use, y = G3))

# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ylab("grade")

# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), famrel = mean(famrel), goout = mean(goout), absences = mean(absences), mean_grade = mean(G3))
## # A tibble: 2 x 6
##   high_use count famrel goout absences mean_grade
##   <lgl>    <int>  <dbl> <dbl>    <dbl>      <dbl>
## 1 FALSE      268   4.00  2.85     3.71       11.7
## 2 TRUE       114   3.78  3.72     6.37       10.8
  1. Using logistic regression

I created a logistic regression model m with the boolean variable high_use as the target variable. High_use returns true if a student consumes alcohol highly. The explaining variables are the previously chosen variables: famrel, goout, absences, and G3. According to the results (see the output of summary(m)), going out and the number of absences positively correlate with high alcohol consumption very significantly (p<0.001). Family relationships (1 = bad, 5 = good) are negatively correlated with alcohol consumption (p<0.05). The relationship between alcohol consumption and the final grade G3 was not statistically significant (p = 0.24).

I combined and printed out the odds ratios and their confidence intervals. As can be seen from the print below, the predictor goout has the widest confidence interval. The interval of G3 is contains the value 1, with an odd ratio of 0.95 - therefore it is hard to determine if G3 is positively or negatively associated with high_use. The odd ratios and confidence intervals tell that goout is (positively) associated with high_use the most. Famrel is negatively associated with high_use (the whole confidence interval <1). Absences are positively associated with high_use, but only a littse, since the confidence interval stands between 1.032 and 1.125.

# find the model with glm()
m <- glm(high_use ~ famrel + goout + absences + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9389  -0.7743  -0.5392   0.9022   2.3966  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.83044    0.78397  -2.335 0.019552 *  
## famrel      -0.34483    0.13563  -2.542 0.011010 *  
## goout        0.75149    0.12033   6.245 4.23e-10 ***
## absences     0.07261    0.02174   3.340 0.000839 ***
## G3          -0.04482    0.03843  -1.166 0.243529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 394.43  on 377  degrees of freedom
## AIC: 404.43
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel       goout    absences          G3 
## -1.83043900 -0.34482682  0.75148543  0.07261377 -0.04481705
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.1603432 0.0331851 0.724006
## famrel      0.7083430 0.5415610 0.923423
## goout       2.1201470 1.6848948 2.703401
## absences    1.0753151 1.0316028 1.124827
## G3          0.9561724 0.8866759 1.031252
  1. Predictive power of the model

I refined my model and excluded the final grade G3 from it because the relationship with alcohol consumption was not statistically significant.

As can be see below from the 2x2 cross tabulation, 69 of (244 + 69 =) 313 students who were predicted as not low users were actually high users of alcohol. 24 of (24 + 45 =) 69 students who were predicted as high users were actually low users. Therefore (69+24)/(313+69) = 0.243455 = 24.3 percent of the individuals were classified inaccurately. The same results can be achieved by building a similar loss function as in the DataCamp exercise. As can be seen, the prediction power is not perfect but it is higher compared to a simple, totally random guessing strategy where the probability of guessing righ from two options would be only 50 percent.

# find the model with glm()
m2 <- glm(high_use ~ famrel + goout + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(m2)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9018  -0.7731  -0.5466   0.9002   2.4180  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.38825    0.63161  -3.781 0.000156 ***
## famrel      -0.34986    0.13534  -2.585 0.009735 ** 
## goout        0.77071    0.11981   6.433 1.25e-10 ***
## absences     0.07446    0.02174   3.425 0.000615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 395.79  on 378  degrees of freedom
## AIC: 403.79
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m2)
## (Intercept)      famrel       goout    absences 
## -2.38824753 -0.34985642  0.77071403  0.07445633
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   24
##    TRUE     69   45
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63874346 0.06282723 0.70157068
##    TRUE  0.18062827 0.11780105 0.29842932
##    Sum   0.81937173 0.18062827 1.00000000
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2434555
  1. Cross-validation

According to a 10-fold cross-validation of the model m2, the prediction error with the test set gets values of around 0.23 to 0.25, which are smaller than the error of the model introduced in DataCamp (circa 0.26 error). The code and results can be seen below.

library(boot)

# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2513089

RStudio Exercise 4: Clustering and classification

The data analysis exercise for the third course week consisted of analysing crime rates in the suburbs of Boston.

  1. Creating the markdown file

A new markdown file ‘chapter4.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.

  1. Loading and exploring the data

I read the Boston data that includes housing values in the suburbs of Boston. Columns include data of for example per capita crime rate, proportion of non-retail business acres, and weighted mean of distances to five Boston employment centres by town. The dimensions of the data are 506 x 14.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(GGally)
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.5
## v readr   1.1.1     v stringr 1.3.1
## v tibble  1.4.2     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)

# Loading the data
data(Boston)

# Checking the data, and the structure, dimensions and summary of the dataset
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
  1. Graphical overview of the data

I the graphical overview of the data. As can be seen from the ggpairs matrix and the summary statistics, the distributions vary between the variables. Crim, for example, can be considered a Poisson distribution, and rm a normal distribution. Many of the distributions are left- or right-taled.

The correlations between variables seem logical. For example crime rates, industrial areas closeby, nitrogen oxides concentration, the high age of the buildings, the accessibility to radial highways, the property tax rate, the pupil-teacher ratio, and the lower status of the population negatively correlate with the median value of occupied homes within a town. The Charles River dummy variable has the smallest absolute correlation among other variables.

# Graphical overview of the data
pairs(Boston)

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

# Summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  1. Standardization of the dataset

I standardized the dataset with the scale-function. The data changed such that the median value of each variable returns 0, and the scaled values have the same variance and form of distribution than the original dataset. I also created a categorical variable of the crime rate in the Boston dataset from the scaled crime rate. I used the quantiles as the break points in the catgorical variable. The old crime rate variable was then dropped from the dataset. Then I divided the dataset to train and test sets so that 80 percent of the data belonged to the train set and 20 percent to the test set.

boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled dataset
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
label_vector <- c("low", "med_low", "med_high", "high")

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = label_vector) 

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#remove the origical crime rate crim and add crime
boston_scaled <- subset(boston_scaled, select = -c(crim))
boston_scaled$crime <- crime
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]
  1. Linear discriminant analysis

I fit the Linear Discriminant Analysis on the train set. I used the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, I drew the LDA plot.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2500000 0.2301980 0.2623762 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.99078418 -0.9332400 -0.12090214 -0.8877387  0.44737987
## med_low  -0.09688843 -0.2538112 -0.07742312 -0.5557500 -0.15583988
## med_high -0.38396622  0.2162019  0.15101504  0.3741765  0.01922156
## high     -0.48724019  1.0170298 -0.04947434  1.0523722 -0.47704722
##                 age        dis        rad        tax     ptratio
## low      -0.9303866  0.8637629 -0.6936501 -0.7087382 -0.46401725
## med_low  -0.3663567  0.3561123 -0.5452263 -0.4602184 -0.06955515
## med_high  0.3486205 -0.3541526 -0.4026980 -0.2873593 -0.26405405
## high      0.8280641 -0.8641007  1.6390172  1.5146914  0.78181164
##                black       lstat          medv
## low       0.37676425 -0.78255250  0.5157226850
## med_low   0.30999356 -0.14994774 -0.0004450806
## med_high  0.04264041  0.05453109  0.1265580838
## high     -0.78364648  0.91347248 -0.7050798876
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.07926381  0.779073998 -0.84005195
## indus    0.03685952 -0.414388724  0.37859343
## chas    -0.03539956 -0.009396646  0.03003258
## nox      0.43765311 -0.627615555 -1.46031089
## rm       0.01866020  0.039276973 -0.25539072
## age      0.28065791 -0.288284257 -0.03545141
## dis     -0.05564182 -0.416457712  0.24890056
## rad      3.21560231  0.860669208  0.15427479
## tax     -0.06173862  0.088399084  0.38727676
## ptratio  0.11305880  0.035946521 -0.22358384
## black   -0.10335674  0.027745100  0.11066038
## lstat    0.17292042 -0.267030100  0.21281495
## medv     0.07142894 -0.446341841 -0.20543156
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9539 0.0345 0.0116
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

  1. Linear discriminant analysis

I saved the crime categories from the test set and removed the categorical crime variable from the test dataset. After this, I predicted the classes with the LDA model on the test data. When cross-tabulating the results with the crime categories from the test set, you can see that the model is more capable in predicting high crime rates compared to low crime rates.

# save the correct classes from test data
correct_classes <- test$crime

test <- subset(test, select = -c(crime))

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
str(lda.pred)
## List of 3
##  $ class    : Factor w/ 4 levels "low","med_low",..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ posterior: num [1:102, 1:4] 0.3168 0.0455 0.0406 0.1828 0.2387 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:102] "6" "10" "11" "14" ...
##   .. ..$ : chr [1:4] "low" "med_low" "med_high" "high"
##  $ x        : num [1:102, 1:3] -3.17 -1.86 -1.72 -2.32 -2.36 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:102] "6" "10" "11" "14" ...
##   .. ..$ : chr [1:3] "LD1" "LD2" "LD3"
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      11        1    0
##   med_low    3      20        2    0
##   med_high   0      11       21    1
##   high       0       0        0   21
  1. K-means algorithm

I reloaded the Boston dataset, standardized the dataset, calculated the distances between the observations, and ran the k-means algorithm on the dataset. When visualizing the clusters with pairs, you can see that the optimal number of clusters turns out to be 3.

# load MASS and Boston
library(MASS)
data('Boston')

boston_scaled.new <- scale(Boston)

# euclidean distance matrix
dist_eu <- dist(boston_scaled.new)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled.new, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

pairs(Boston[6:10], col = km$cluster)